### Basic Questions:1. What are the distributions of the number of passengers per trip, payment type, fare amount, tip amount, and total amount?2. What are top 5 busiest hours of the day, and the top 10 busiest locations of the city?3. What is the hourly taxi activity for each day of the week?4. Which trip has the most consistent fares?### Open Questions:1. Can you predict the fare and tip amount based on the pickup / drop off location, time, and day of the week?2. Can you predict the pickup / drop off geographical distribution for each hour of a weekday?3. If you were a taxi owner, how would you maximize your earnings in a day?4. If you run a taxi company, how would you maximize your earnings?What are the distributions of the number of passengers per trip, payment type, fare amount, tip amount, and total amount?
What are top 5 busiest hours of the day, and the top 10 busiest locations of the city?
What is the hourly taxi activity for each day of the week?
Which trip has the most consistent fares?
Can you predict the fare and tip amount based on the pickup / drop off location, time, and day of the week?
Can you predict the pickup / drop off geographical distribution for each hour of a weekday?
If you were a taxi owner, how would you maximize your earnings in a day?
If you run a taxi company, how would you maximize your earnings?
t=1timport pandas as pdimport numpy as npimport matplotlib import matplotlib.pyplot as plt import numpy as npimport plotly.plotly as pyfrom plotly.offline import download_plotlyjs, init_notebook_mode, iplot, plotimport plotly.figure_factory as ffimport plotly.graph_objs as gofrom plotly import tools#initiate the Plotly Notebook modeinit_notebook_mode()df_big = pd.read_csv('yellow_tripdata_2016-01.csv')#df_big_clean=df_big.fillna(df_big.mean())#df_big.dropna(axis=1)df_big_clean=df_big#df_big_clean <- df_big[!(is.na(df$start_pc) | df$start_pc==""), ] #| is an or-operator and ! inverts. #Hence, the command above displays all rows, which are not b) NA or b) equal to ""df=df_big_clean.loc[0:10000,:] #use reduces data points for testing modeprint(df_big.shape)print(df_big_clean.shape)df#help(plotly.offline.iplot)## Insight 1: Passenger numbers * Most NY Taxi trips transport solo passengersimport numpy as npimport plotly.plotly as py#import plotly.offline as offlinefrom plotly.offline import download_plotlyjs, init_notebook_mode, iplot, plotimport plotly.graph_objs as goinit_notebook_mode()#extract number of people per trippeps_per_trip_df=df.loc[:, df.columns.str.match('passenger_count')]peps_per_trip_df.shape#print(type(peps_per_trip_df))peps_per_trip=df.loc[:, df.columns.str.match('passenger_count')].values#print(type(peps_per_trip))layout=go.Layout(title="First Plot", xaxis={'title':'x1'}, yaxis={'title':'x2'})data = [go.Histogram(x=peps_per_trip)] #or [dataset1, darset2]layout = go.Layout( title='Histogram of Passenger numbers', xaxis=dict( title='passenger number' ), yaxis=dict( title='Count' ), bargap=0.2, bargroupgap=0.1)fig = go.Figure(data=data, layout=layout)py.iplot(fig, filename='People_per_trip_histogram') #this plots in online mode, limit of 50/day in community a/c#iplot(fig, filename='People_per_trip_histogram') #This plots when offline; no limitxxxxxxxxxx## Insight 2: cash versus credit * New Yorkers prefer to pay with credit card (60:40 ratio in preference of credit card)* Cash usage is considerable at 40%. The cash option is a point of difference over competitor Uber. * Distribution of fares is similar across cash and credit card payments (median credit card fare is $1 higher than cash fare)* Peak at $\$52$ is likely to represent Manhattan -> JFK airport trips (This has a flat rate fee of $52, source @wikipedia) * NY taxi fares are cheap (compared to Melbourne!). Median fare around \$10 x
# Distribution: Payment by type# Add histogram data# extract fares by payment type# 1=cc, 2=cash, 3=no charge, 4=dispute, 5=unknown, 6=voided tripfare_paymenttype1=df.loc[df['payment_type'] == 1, 'fare_amount'].values #credit cardfare_paymenttype2=df.loc[df['payment_type'] == 2, 'fare_amount'].values #cash#fare_paymenttype4=df.loc[df['payment_type'] == 4, 'fare_amount'].values #disputefare_payments=np.append(fare_paymenttype1,fare_paymenttype2)total_paymentstype1=df.loc[df['payment_type'] == 1, 'total_amount'].values #fare+tips+tolstotal_paymentstype2=df.loc[df['payment_type'] == 2, 'total_amount'].values #fare+tips+tolstip_amountstype1=df.loc[df['payment_type'] == 1, 'tip_amount'].values #fare+tips+tolstotal_payments=np.append(total_paymentstype1,total_paymentstype2)numberofCCpays=df.loc[df['payment_type'] == 1, 'payment_type'].sum()numberofCashpays=df.loc[df['payment_type'] == 2, 'payment_type'].sum()/2PcentofCCpays=np.round(numberofCCpays*100/(numberofCashpays+numberofCCpays), decimals=1)#print(PcentofCCpays)PcentofCashpays=np.round(numberofCashpays*100/(numberofCashpays+numberofCCpays), decimals=1)#print(PcentofCashpays)#print(type(fare_paymenttype2[1:10]))# Group data togetherhist_data = [fare_paymenttype1,fare_paymenttype2]find_median1=np.median(fare_paymenttype1)find_median2=np.median(fare_paymenttype2)#print(find_median)group_labels = ['Credit card', 'Cash']# Create distplot with custom bin_sizefig = ff.create_distplot(hist_data, group_labels, bin_size=1.0)fig.layout.update({'title': 'Distribution of Fares'})fig.layout.xaxis1.update({'title': '$ amounts'})# Plot!#py.iplot(fig, filename='Distplot with Multiple Datasets') #online plot modeiplot(fig, filename='Distplot with Multiple Datasets') #offline modefrom IPython.display import display, Math, Latexdisplay(Math(r'\text{Percentage of credit card payments is } %s \text{%%}' % PcentofCCpays))display(Math(r'\text{Median credit payment is \$} %s ' % find_median1))display(Math(r'\text{Percentage of cash payments is } %s \text{%%}' % PcentofCashpays))display(Math(r'\text{Median cash payment is \$} %s' % find_median2))xxxxxxxxxx## Insight 3: fare breakdown* Median Tip (credit card data only) is 20% of the fare# Group data togetherhist_data2 = [fare_payments,total_payments,tip_amountstype1]group_labels2 = ['Fare', 'Total Charge', 'Tip Amount']# Create distplot with custom bin_sizefig2 = ff.create_distplot(hist_data2, group_labels2, bin_size=[0.5,0.5,0.4])fig2.layout.update({'title': 'Breakdown & Distribution of NY Taxi Fares'})fig2.layout.xaxis1.update({'title': '$ amounts'})# Plot!#py.iplot(fig2, filename='Distplot with Multiple Datasets2') # online plot optioniplot(fig2, filename='Distplot with Multiple Datasets2') # offline plot optionfind_mediantip=np.median(tip_amountstype1)Med_tip_percentage=np.round(find_mediantip*100/find_median1, decimals=1)display(Math(r'\text{Median tip payment (Credit card payment data only) is \$} %s ' % find_mediantip))display(Math(r'\text{Median tip percentage (Credit card payment data only) is } %s \text{%%}' % Med_tip_percentage))## Insight 3: Pick up and Drop off locations* Manhattan (central business zone) is the busiest area for taxi use* Airports (La Guardia and JFK) feature strongly in usage maps * Curiously, people get dropped off to the airports at very fixed locations, while pick-up locations are more diffuse * Is there a culture of people wandering out from the airport and hailing taxis from wherever; no easily to locate taxi ranks? * People **start taxi journeys** most frequently: 1. in Manhattan on the **main streets** 2. on the **main arterial routes** within residential areas (Brooklyn, Queens) * The *Sex And The City* imagery of hailing taxis on demand from busy streets is backed up by the data * People **end taxi journeys** most frequently: 1. again in Manhattan, both on main streets and off the main streets 2. at very **diffuse locations** across residential areas (Brooklyn, Queens, The Bronx) * The Bronx is a frequent drop-off location, but rarely a pick-up location * An effect of green "boro taxis" since 2013? (Note, however, that boroughs where green taxis can be hailed include The Bronx, Queens and Brooklyn: yet the Bronx taxi pattern is notably different)# Map the pick up locationsimport pandas as pdimport matplotlib import matplotlib.pyplot as plt from matplotlib import rcParams df=df_big#pd.options.display.mpl_style = 'default' #Better Styling matplotlib.pyplot.style.use('ggplot')new_style = {'grid': False} #Grid off matplotlib.rc('axes', **new_style) rcParams['figure.figsize'] = (12, 12) #Size of figure rcParams['figure.dpi'] = 250P=df.plot(kind='scatter', x='pickup_longitude', y='pickup_latitude',color='white',xlim=(-74.06,-73.77),ylim=(40.61, 40.91),s=.02,alpha=.3)#P.set_axis_bgcolor('black') #Background ColorP.set_facecolor('black') #Background Colour#plt.show()# Map the drop off locationsdf=df_bigimport matplotlib import matplotlib.pyplot as plt from matplotlib import rcParams ##Inline Plotting for jupyter Notebook #%matplotlib inline #pd.options.display.mpl_style = 'default' #Better Styling matplotlib.pyplot.style.use('ggplot')new_style = {'grid': False} #Grid off matplotlib.rc('axes', **new_style) rcParams['figure.figsize'] = (12, 12) #Size of figure rcParams['figure.dpi'] = 250P=df.plot(kind='scatter', x='dropoff_longitude', y='dropoff_latitude',color='white',xlim=(-74.06,-73.77),ylim=(40.61, 40.91),s=.02,alpha=.3) #s is size and alpha is opaque-ness P.set_facecolor('black') #Background Colourplt.show()#Top 10 busiest locations of the city#import reverse_geocoder as rgfrom geopy.geocoders import Nominatimdf=df_big#round the lat and long entries #Latitude_round=df.loc[df['payment_type'] == 1, 'fare_amount'].valuesLatitude_round=np.round(df['pickup_latitude'].values, decimals=2)+0.005 #round and recentre grid boxLongitude_round=np.round(df['pickup_longitude'].values, decimals=2)+0.005 #round and recentre grid box#print(Latitude_round[0:5])#print(Longitude_round[0:5])df.loc[:,'GridcodeLat'] = pd.Series(Latitude_round, index=df.index) #add column gridcodes to dfdf.loc[:,'GridcodeLon'] = pd.Series(Longitude_round, index=df.index) #add column gridcodes to df#find 10 locations with most common grid codesmytable = df.groupby(['GridcodeLat','GridcodeLon']).size()mytable.sort_values(inplace=True,ascending=False)totaltrips=mytable.sum()print('Total trips')print(totaltrips)Top10BusyPickupLocations=mytable.head(30)#print(Top10BusyPickupLocations)#print(type(Top10BusyPickupLocations))Top10BusyPickupLocations=Top10BusyPickupLocations.to_frame()print(Top10BusyPickupLocations)print(type(Top10BusyPickupLocations))#coordinates = (51.5214588,-0.1729636),(9.936033, 76.259952),(37.38605,-122.08385)coordinates = Top10BusyPickupLocations.index.values.tolist()print(coordinates)type(coordinates)#results = rg.search(coordinates) # default mode = 2, reverse geocode from lat and long to address#print(results)geolocator = Nominatim()#locations = geolocator.reverse("40.755, -73.985")for i in range(0,30): try: location = geolocator.reverse(coordinates[i]) #print(location) except: PlaceNames='Unknown, Unknown, Unknown, Unknown, Unknown' PlaceNames=location.address.split(",") print([PlaceNames[-8],PlaceNames[-7],PlaceNames[-6]] ) #df1.loc[:,'f'] = p.Series(np.random.randn(sLength), index=df1.index) #add column f to df1#plot table or pie chart#Top10BusyPickupLocations['GridcodeLat','GridcodeLon'].valuesTop10BusyPickupLocations.index.valuescoordinates[2]1,#plot pie chart of Top 10 busiest locations# Add graph datatrace1={'labels': ['1st', '2nd', '3rd', '4th', '5th'], 'values': [38, 27, 18, 10, 7], 'type': 'pie', 'name': 'Starry Night', 'marker': {'colors': ['rgb(56, 75, 126)', 'rgb(18, 36, 37)', 'rgb(34, 53, 101)', 'rgb(36, 55, 57)', 'rgb(6, 4, 4)']}, 'domain': {'x': [0, 1], 'y': [.4, 1]}, 'hoverinfo':'label+percent+name', 'textinfo':'none' }# Add trace data to figurefigure['data'].extend(go.Data([trace1]))# Edit layout for subplotsfigure.layout.yaxis.update({'domain': [0, .30]})# The graph's yaxis2 MUST BE anchored to the graph's xaxis2 and vice versa# Update the margins to add a title and see graph x-labels. figure.layout.margin.update({'t':75, 'l':50})figure.layout.update({'title': 'Starry Night'})# Update the height because adding a graph vertically will interact with# the plot height calculated for the tablefigure.layout.update({'height':800})# Plot!py.iplot(figure)#classfiy into manhattan, JFK airport, laGuardia#Q's what percentage are those airport trips# map the fare disputes/ scrap as not many of these#find out % of trips paid by cc versus cash#insights: lots of drop offs to brooklyn, queens, bronx. less pick ups from these areas. People get taxi's home rather than to work?#time of day?, weekend? And people seem to get picked up from main streets! (the sex and city iconography of hailing a cab is true!)#interesting in times of UBER#plot Distribution: Passenger numbers per tripimport numpy as npimport plotly.plotly as pyfrom plotly.offline import init_notebook_mode, iplot, plotimport plotly.figure_factory as ffimport plotly.graph_objs as go#peps_per_triprav = peps_per_trip.ravel() print(peps_per_trip)#below ply works, put plotly dist plot not happy#print(df.shape)#df = df.replace('[]', np.nan)#Soln (a) replace all elements that have any empty value with NaN values#df=df.dropna() #Soln (b) drop all rows that have any NaN values#print(df.shape)peps_per_trip=df.loc[:, df.columns.str.match('passenger_count')].valueshist_data = [peps_per_trip]group_labels = ['distplot']#plt.plot(peps_per_trip)#plt.show()fig = ff.create_distplot(hist_data, group_labels)fig['layout'].update(title='Distribution: Passenger numbers per trip')py.iplot(fig, filename='DistplotPepsPerTrip')#import plotly#plotly.tools.set_credentials_file(username='eosg', api_key='AmlsmkQM0FkVbEPtlQSf')#plotly.tools.set_credentials_file(username='elmao', api_key='8z69RhuTfVA7EdkIEtXZ')## If you run a taxi company, how would you maximize your earnings?Uber is a major market distrupter in the taxi space. To maximise taxi company earnings, concurrent analysis of uber versus taxi data is nesscessary.Thoughts: On cold NY winter mornings (or in the rain!) does Uber now take a big share of the historical taxi market (direct from door pick up rather than walking to major route to hail a taxi)* UberT has entered the market gap here (can request a yellow taxi to your door through the uber app)Uber is a major market distrupter in the taxi space. To maximise taxi company earnings, concurrent analysis of uber versus taxi data is nesscessary.
Thoughts: On cold NY winter mornings (or in the rain!) does Uber now take a big share of the historical taxi market (direct from door pick up rather than walking to major route to hail a taxi)
#basic Histograms#extract number of people per trippeps_per_trip_df=df.loc[:, df.columns.str.match('passenger_count')]peps_per_trip_df.shapeprint(type(peps_per_trip_df))peps_per_trip=df.loc[:, df.columns.str.match('passenger_count')].valuesprint(type(peps_per_trip))fare_paymenttype1=df.loc[df['payment_type'] == 1, 'fare_amount'].valuesfare_paymenttype2=df.loc[df['payment_type'] == 2, 'fare_amount'].valuesfare_paymenttype4=df.loc[df['payment_type'] == 4, 'fare_amount'].valuestype(fare_paymenttype1)#1=cc, 2=cash, 3=no charge, 4=dispute, 5=unknown, 6=voided trip#rate code ID (final rate code at end of the trip): 1=standard rate, 2=JFK, 3=Newark, 5=Nassau or Westchester, 5=Negotiated fare, 6=Group ride